Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction
نویسندگان
چکیده
Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decisiontrees are terms associated to the concept “Machine Learning” ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as interesting/not interesting. From these examples, the ROGER algorithm learns a numerical function, inducing some ranking on the collocations. This ranking is optimized using genetic algorithms, maximizing the trade-off between the false positive and true positive rates (Area Under the ROC curve). This approach uses a particular representation for the word collocations, namely the vector of values corresponding to the standard statistical interestingness measures attached to this collocation. As this representation is general (over corpora and natural languages), generality tests were performed by experimenting the ranking function learned from an English corpus in Biology, onto a French corpus of Curriculum Vitae, and vice versa, showing a good robustness of the approaches compared to the state-of-the-art Support Vector Machine (SVM). Keywords— Text-mining, Terminology Extraction, Evolutionary algorithm, ROC Curve.
منابع مشابه
Learning Interestingness Measures in Terminology Extraction. A ROC-based approach
In the field of Text Mining, a key phase in data preparation is concerned with the extraction of terms, i.e. collocation of words attached to specific concepts (e.g. Philosophy-Dissertation). In this paper, Term Extraction is formalized as a supervised learning task, extracting a ranking hypothesis from a set of terms labeled as relevant/irrelevant by the expert. This task is tackled using the ...
متن کاملPreference Learning in Terminology Extraction: A ROC-based approach
A key data preparation step in Text Mining, Term Extraction selects the terms, or collocation of words, attached to specific concepts. In this paper, the task of extracting relevant collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as relevant/irrelevant. The candidate terms are described along 13 standard statistical criteria meas...
متن کاملIntegrating Query Performance Prediction in Term Scoring for Diachronic Thesaurus
A diachronic thesaurus is a lexical resource that aims to map between modern terms and their semantically related terms in earlier periods. In this paper, we investigate the task of collecting a list of relevant modern target terms for a domain-specific diachronic thesaurus. We propose a supervised learning scheme, which integrates features from two closely related fields: Terminology Extractio...
متن کاملA Domain Independent Approach for Extracting Terms from Research Papers
We study the problem of extracting terms from research papers, which is an important step towards building knowledge graphs in research domain. Existing terminology extraction approaches are mostly domain dependent. They use domain specific linguistic rules, supervised machine learning techniques or a combination of the two to extract the terms. Using domain knowledge requires much human effort...
متن کاملUDLAP: Sentiment Analysis Using a Graph-Based Representation
We present an approach for tackling the Sentiment Analysis problem in SemEval 2015. The approach is based on the use of a cooccurrence graph to represent existing relationships among terms in a document with the aim of using centrality measures to extract the most representative words that express the sentiment. These words are then used in a supervised learning algorithm as features to obtain ...
متن کامل